Injecting Structured Data to Generative Topic Model in Enterprise Settings

نویسندگان

  • Han Xiao
  • Xiaojie Wang
  • Chao Du
چکیده

Enterprises have accumulated both structured and unstructured data steadily as computing resources improve. However, previous research on enterprise data mining often treats these two kinds of data independently and omits mutual benefits. We explore the approach to incorporate a common type of structured data (i.e. organigram) into generative topic model. Our approach, the Partially Observed Topic model (POT), not only considers the unstructured words, but also takes into account the structured information in its generation process. By integrating the structured data implicitly, the mixed topics over document are partially observed during the Gibbs sampling procedure. This allows POT to learn topic pertinently and directionally, which makes it easy tuning and suitable for end-use application. We evaluate our proposed new model on a real-world dataset and show the result of improved expressiveness over traditional LDA. In the task of document classification, POT also demonstrates more discriminative power than LDA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inter-battery Topic Representation Learning

In this paper, we present the Inter-Battery Topic Model (IBTM). Our approach extends traditional topic models by learning a factorized latent variable representation. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and abili...

متن کامل

Modeling language and cognition with deep unsupervised learning: a tutorial overview

Deep unsupervised learning in stochastic recurrent neural networks with many layers of hidden units is a recent breakthrough in neural computation research. These networks build a hierarchy of progressively more complex distributed representations of the sensory data by fitting a hierarchical generative model. In this article we discuss the theoretical foundations of this approach and we review...

متن کامل

On-Line Labeled Topic Model

A large number of electronic documents are labeled using human-interpretable annotations. High-efficiency text mining on such data set requires generative model that can flexibly comprehend the significance of observed labels while simultaneously uncovering topics within unlabeled documents. This paper presents a novel and generalized on-line labeled topic model (OLT) tracking the time developm...

متن کامل

Experiments with Latent Dirichlet Allocation

Latent Dirichlet Allocation is a generative topic model for text. In this report, we implement collapsed Gibbs sampling to learn the topic model. We test our implementation on two data sets: classic400 and Psychological Abstract Review. We also discuss the different evaluation of goodness-of-fit of the models how parameter settings interact with the goodness-of-fit.

متن کامل

Probabilistic Topic Modeling in Multilingual Settings: A Short Overview of Its Methodology and Applications

Probabilistic topic models are unsupervised generative models that model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingua...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009